Caches is typically one of the hardest topics for students in Computer Architecture to grasp at first. This exercise will use some cool cache visualization tools in Venus to get you more familiar with cache behavior and performance terminology with the help of the file Cache.S. At this point, read through cache.s to get a rough idea of what the program does.
To get started with each of the scenarios below:
Get familiar with the parameters in cache windows:
The Data Cache Simulator will show the state of your data cache. Please remember that these are running with your code, so if you reset your code, it will also reset your cache and memory status.
If you run the code all at once, you will get the final state of the cache and hit rate. You will probably benefit the most from setting breakpoints at each memory access to see exactly where the hits and misses are coming from. Themethod to set a breakpoint in Venus is just click the corresponding line in the simulator, if the line become red, that means your program will stop when the execution meets that line.
Simulate the following scenarios and record the final cache hit rates. Try to reason out what the hit rate will be BEFORE running the code. After running each simulation, make sure you understand WHY you see what you see (the TAs will be asking)!
Do not hesitate to ask questions if you feel confused! This is perfectly normal and the TA is there to help you out!
Good questions to ask yourself as you do these exercises:
Cache Parameters:
Program Parameters:
Checkoff
Cache Parameters:
Program Parameters:
Checkoff
Cache Parameters:
Program Parameters:
Checkoff
In image processing, a Gaussian blur (also known as Gaussian smoothing) is the result of blurring an image by a Gaussian function. Mathematically, applying a Gaussian blur to an image is the same as convolving the image with a Gaussian function. In this lab, we adopt a 1-dimensional Gaussian distribution kernel, and the blurring process is done in two steps: Given image A as our input, we first convolve the kernel over the rows of image A to produce a horizontally blurred image B. We then convolve the kernel over the columns of image B to produce a horizontally and vertically blurred image C. The image C is our final blurred image
The process of convolving a image works like below. It consists a simple multiplication and add.
We provide an implement of Gaussian Blur here and your job is to optimize the program without changing the algorithm. To make things easy, you only need to focus on apply_gb_fast.c.
In apply_gb_fast.c, there is a function called apply_gb(). This function will receive two parameters, where Image a indicates the input image and FVec gv indicates the kernel. It will call gb_h and gb_v to do convolvtion horizontally and vertically. gb_h and gb_v will return a new image.
At first, you can use make base_test to run the origin version of gaussian blur, which will show you the time of gb_h and gh_v. Then, you will find there is a gap between the two time.
Then, to optimize the program, we can take another look on the execution order of Gaussian Blur. The vertial onvolvtion equals to apply horizontal convolvtion to a transposed matrix. Thus, we can first transpose the image , apply horizontal concolvtion to it and finally transpose it angin to get a correct result. In this way, we can optimize the memory access performance of the program.
In apply_gb_fast.c, there is a completed function transpose(), which will return a transposed image of the input image. You can use it to optimize your program following the method mentioned above.
You can run make all to test your modified program. The program test_accuracy will test the result of your program and output the average error between your result and the correct result.
To make the program even faster, we can apply cache blocking to the function transpose(), which can be learned from here
Checkoff: Show your program to your TA and answer the following questions:
1. Why there is a gap between gb_v and gb_h ?
2. Why the changed execution order will achieve a better performance even if we do more things(transpose)?
Some data structures are cache friendly while others will cause a lot of cache miss. For those programs whose workloads are mainly in data access instead of calculation, cache miss will influence to performance significantly.
Every time there is someone who visits our website, the website log engine will record some information such as ip and state. The website log engine will do some operations on the recorded logs, where the main work is accessing data. To simplify the situation, we provide a demo web log engine which will tranverse all logs and do map function to some information
Your task is to modify the given struct log_entry in log_fast.c to make the data structure more cache friendly.You can use the following command to test your program's performance compared with the origin one.(If you use a virtual machine, you may need to increase the memory if there comes a sigment fault)
$ make all
In the function tranverse(), we only use three members in the log_entry. However, the three members are seperated be the large arries, which make them placed into three differnet cache lines. Thus, each access of one element in the array logs will cause 3 cache misses.
Checkoff: Show your result to your TA and explain why you do this modification.